Method for time aligning audio signals using characterizations based on auditory events

ABSTRACT

A method for time aligning audio signal, wherein one signal has been derived from the other or both have been derived from another signal, comprises deriving reduced-information characterizations of the audio signals, auditory scene analysis. The time offset of one characterization with respect to the other characterization is calculated and the temporal relationship of the audio signals with respect to each other is modified in response to the time offset such that the audio signals are coicident with each other. These principles may also be applied to a method for time aligning a video signal and an audio signal that will be subjected to differential time offsets.

TECHNICAL FIELD

[0001] The invention relates to audio signals. More particularly, theinvention relates to characterizing audio signals and usingcharacterizations to time align or synchronize audio signals wherein onesignal has been derived from the other or in which both have beenderived from the same other signal. Such synchronization is useful, forexample, in restoring television audio to video synchronization(lip-sync) and in detecting a watermark embedded in an audio signal (thewatermarked signal is compared to an unwatermarked version of diesignal). The invention may be implemented so that a low processing powerprocess brings two such audio signals into substantial temporalalignment.

BACKGROUND ART

[0002] The division of sounds into units perceived as separate issometimes referred to as “auditory event analysis” or “auditory sceneanalysis” (“ASA”). An extensive discussion of auditory scene analysis isset forth by Albert S. Bregman in his book Auditory Scene Analysis—ThePerceptual Organization of Sound, Massachusetts Institute of Technology,1991, Fourth printing, 2001, Second MIT Press paperback edition. Inaddition, U.S. Pat. No. 6,002,776 to Bhadkamkar, et al, Dec. 14, 1999cites publications dating back to 1976 as “prior art work related tosound separation by auditory scene analysis.” However, the Bhadkamkar,et al patent discourages the practical use of auditory scene analysis,concluding that “[t]echniques involving auditory scene analysis,although interesting from a scientific point of view as models of humanauditory processing, are currently far too computationally demanding andspecialized to be considered practical techniques for sound separationuntil fundamental progress is made.”

[0003] Bregman notes in one passage that “[w]e hear discrete units whenthe sound changes abruptly in timbre, pitch, loudness, or (to a lesserextent) location in space.” (Auditory Scene Analysis—The PerceptualOrganization of Sound, supra at page 469). Bregman also discusses theperception of multiple simultaneous sound streams when, for example,they are separated in frequency.

[0004] There are many different methods for extracting characteristicsor features from audio. Provided the features or characteristics aresuitably defined, their extraction can be performed using automatedprocesses. For example “ISO/IEC JTC 1/SC 29/WG 11” (MPEG) is currentlystandardizing a variety of audio descriptors as part of the MPEG-7standard. A common shortcoming of such methods is that they ignore ASA.Such methods seek to measure, periodically, certain “classical” signalprocessing parameters such as pitch, amplitude, power, harmonicstructure and spectral flatness. Such parameters, while providing usefulinformation, do not analyze and characterize audio signals into elementsperceived as separate according to human cognition.

[0005] Auditory scene analysis attempts to characterize audio signals ina manner similar to human perception by identifying elements that areseparate according to human cognition. By developing such methods, onecan implement automated processes that accurately perform tasks thatheretofore would have required human assistance.

[0006] The identification of separately perceived elements would allowthe unique identification of an audio signal using substantially lessinformation than the full signal itself. Compact and uniqueidentifications based on auditory events may be employed, for example,to identify a signal that is copied from another signal (or is copiedfrom the same original signal as another signal).

DISCLOSURE OF THE INVENTION

[0007] A method is described that generates a unique reduced-informationcharacterization of an audio signal that may be used to identify theaudio signal. The characterization may be considered a “signature” or“fingerprint” of the audio signal. According to the present invention,an auditory scene analysis (ASA) is performed to identify auditoryevents as the basis for characterizing an audio signal. Ideally, theauditory scene analysis identifies auditory events that are most likelyto be perceived by a human listener even after the audio has undergoneprocessing, such as low bit rate coding or acoustic transmission througha loudspeaker. The audio signal may be characterized by the boundarylocations of auditory events and, optionally, by the dominant frequencysubband of each auditory event. The resulting information pattern,constitutes a compact audio fingerprint or signature that may becompared to the fingerprint or signature of a related audio signal todetermine quickly and/or with low processing power the time offsetbetween the original audio signals. The reduced-informationcharacteristics have substantially the same relative timing as the audiosignals they represent.

[0008] The auditory scene analysis method according to the presentinvention provides a fast and accurate method of time aligning two audiosignals, particularly music, by comparing signatures containing auditoryevent information. ASA extracts information underlying the perception ofsimilarity, in contrast to traditional methods that extract featuresless fundamental to perceiving similarities between audio signals (suchas pitch amplitude, power, and harmonic structure). The use of ASAimproves the chance of finding similarity in, and hence time aligning,material that has undergone significant processing, such as low bitcoding or acoustic transmission through a loudspeaker.

[0009] In the embodiments discussed below, it is assumed that the twoaudio signals under discussion are derived from a common source. Themethod of the present invention determines the time offset of one suchaudio signal with respect to the other so that they may be brought intoapproximate synchronism with respect to each other.

[0010] Although in principle the invention may be practiced either inthe analog or digital domain (or some combination of the two), inpractical embodiments of the invention, audio signals are represented bysamples in blocks of data and processing is done in the digital domain.

[0011] Referring to FIG. 1A, auditory scene analysis 2 is applied to anaudio signal in order to produce a “signature” or “fingerprint,” relatedto that signal. In this case, there are two audio signals of interest.They are similar in that one is derived from the other or both have beenpreviously derived from the same original signal. Thus, auditory sceneanalysis is applied to both signals. For simplicity, FIG. 1A shows onlythe application of ASA to one signal. As shown in FIG. 1B, thesignatures for the two audio signals, Signature 1 and Signature 2, areapplied to a time offset calculation function 4 that calculates an“offset” output that is a measure of the relative time offset betweenthe two signatures.

[0012] Because the signatures are representative of the audio signalsbut are substantially shorter (i.e., they are more compact or have fewerbits) than the audio signals from which they were derived, the timeoffset between the signatures can be determined much faster than itwould take to determine the time offset between the audio signals.Moreover, because the signatures retain substantially the same relativetiming relationship as the audio signals from which they are derived, acalculation of the offset between the signatures is usable to time alignthe original audio signals. Thus, the offset output of function 4 isapplied to a time alignment function 6. The time alignment function alsoreceives the two audio signals, Audio signal 1 and Audio signal 2 (fromwhich Signature 1 and 2 were derived), and provides two audio signaloutputs, Audio signal 3 and Audio signal 4. It is desired to adjust therelative timing of Audio signal 1 with respect to Audio signal 2 so thatthey are in time alignment (synchronism) or are nearly in timealignment. To accomplish this, one may be time shifted with respect tothe other or, in principle, both may be time shifted. In practice, oneof the audio signals is a “pass through” of Audio signal 1 or Audiosignal 2 (ie., it is substantially the same signal) and the other is atime shifted version of the other audio signal that has been temporallymodified so that Audio Signal 3 and Audio Signal 4 are in timesynchronism or nearly in time synchronism with each other, depending onthe resolution accuracy of the offset calculation and time alignmentfunctions. If greater alignment accuracy is desired, further processingmay be applied to Audio Signal 3 and/or Audio Signal 4 by one or moreother processes that form no part of the present invention.

[0013] The time alignment of the signals may be useful, for example, inrestoring television audio to video synchronization (lip-sync) and indetecting a watermark embedded in an audio signal. In the former case, asignature of the audio is embedded in the video signal prior totransmission or storage that may result in the audio and video gettingout of synchronism. At a reproduction point, a signature may be derivedfrom the audio signal and compared to the signature embedded in thevideo signal in order to restore their synchronism. Systems of that typenot employing characterizations based on auditory scene analysis aredescribed in U.S. Pat. Nos. Re 33,535, 5,202,761, 6,211,919, and6,246,439, all of which are incorporated herein by reference in theirentireties. In the second case, an original version of an audio signalis compared to a watermarked version of the audio signal in order torecover the watermark. Such recovery requires close temporal alignmentof the two audio signals. This may be achieved, at least to a firstdegree of alignment by deriving a signature of each audio signal to aidin time alignment of the original audio signals, as explained herein.Further details of FIGS. 1A and 1B are set forth below.

[0014] For some applications, the processes of FIGS. 1A and 1B should bereal-time. For other applications, they need not be real-time. In areal-time application, the process stores a history (a few seconds, forexample) of the auditory scene analysis for each input signal.Periodically, that event history is employed to update the offsetcalculation in order to continually correct the time offset. Theauditory scene analysis information for each of the input signals may begenerated in real time, or the information for either of the signals mayalready be present (assuming that some offline auditory scene analysisprocessing has already been performed). One use for a real-time systemis, for example, an audio/video aligner as mentioned above. One seriesof event boundaries is derived from the audio; the other series of eventboundaries is recovered from the video (assuming some previous embeddingof the audio event boundaries into the video). The two event boundarysequences can be periodically compared to determine the time offsetbetween the audio and video in order to improve the lip sync, forexample.

[0015] Thus, both signatures may be generated from the audio signals atnearly the same time that the time offset of the signatures iscalculated and used to modify the alignment of the audio signals toachieve their substantial coincidence. Alternatively, one of thesignatures to be compared may be carried along with the audio signalfrom which it was derived, for example, by embedding the signature inanother signal, such as a video signal as in the case of audio and videoalignment as just described. As a further alternative, both signaturesmay be generated in advance and only the comparison and timingmodification performed in real time. For example, in the case of twosources of the same television program (with both video and audio), bothwith embedded audio signatures; the respective television signals (withaccompanying audio) could be synchronized (both video and audio) bycomparing the recovered signatures. The relative timing relationship ofthe video and audio in each television signal would remain unaltered.The television signal synchronization would occur in real time, butneither signature would be generated at that time nor simultaneouslywith each other.

[0016] In accordance with aspects of the present invention, acomputationally efficient process for dividing audio into temporalsegments or “auditory events” that tend to be perceived as separate isprovided.

[0017] A powerful indicator of the beginning or end of a perceivedauditory event is believed to be a change in spectral content. In orderto detect changes in timbre and pitch (spectral content) and, as anancillary result, certain changes in amplitude, the audio eventdetection process according to an aspect of the present inventiondetects changes in spectral composition with respect to time.Optionally, according to a further aspect of the present invention, theprocess may also detect changes in amplitude with respect to time thatwould not be detected by detecting changes in spectral composition withrespect to time.

[0018] In its least computationally demanding implementation, theprocess divides audio into time segments by analyzing the entirefrequency band of the audio signal (full bandwidth audio) orsubstantially the entire frequency band (in practical implementations,band limiting filtering at the ends of the spectrum are often employed)and giving the greatest weight to the loudest audio signal components.This approach takes advantage of a psychoacoustic phenomenon in which atsmaller time scales (20 milliseconds (msec) and less) the ear may tendto focus on a single auditory event at a given time. This implies thatwhile multiple events may be occurring at the same time, one componenttends to be perceptually most prominent and may be processedindividually as though it were the only event taking place. Takingadvantage of this effect also allows the auditory event detection toscale with the complexity of the audio being processed. For example, ifthe input audio signal being processed is a solo instrument, the audioevents that are identified will likely be the individual notes beingplayed. Similarly for an input voice signal, the individual componentsof speech, the vowels and consonants for example, will likely beidentified as individual audio elements. As the complexity of the audioincreases, such as music with a drumbeat or multiple instruments andvoice, the auditory event detection identifies the most prominent (i.e.,the loudest) audio element at any given moment. Alternatively, the “mostprominent” audio element may be determined by taking hearing thresholdand frequency response into consideration.

[0019] Optionally, according to further aspects of the presentinvention, at the expense of greater computational complexity, theprocess may also take into consideration changes in spectral compositionwith respect to time in discrete frequency bands (fixed or dynamicallydetermined or both fixed and dynamically determined bands) rather thanthe full bandwidth. This alternative approach would take into accountmore than one audio stream in different frequency bands rather thanassuming that only a single stream is perceptible at a particular time.

[0020] Even a simple and computationally efficient process according toan aspect of the present invention for segmenting audio has been founduseful to identify auditory events.

[0021] An auditory event detecting process of the present invention maybe implemented by dividing a time domain audio waveform into timeintervals or blocks and then converting the data in each block to thefrequency domain, using either a filter bank or a time-frequencytransformation, such as a Discrete Fourier Transform (DFT) (implementedas a Fast Fourier Transform (FFT) for speed). The amplitude of thespectral content of each block may be normalized in order to eliminateor reduce the effect of amplitude changes. Each resulting frequencydomain representation provides an indication of the spectral content(amplitude as a function of frequency) of the audio in the particularblock. The spectral content of successive blocks is compared and eachchange greater than a threshold may be taken to indicate the temporalstart or temporal end of an auditory event.

[0022] In order to minimize the computational complexity, only a singleband of frequencies of the time domain audio waveform may be processed,preferably either the entire frequency band of the spectrum (which maybe about 50 Hz to 15 kHz in the case of an average quality music system)or substantially the entire frequency band (for example, a band definingfilter may exclude the high and low frequency extremes).

[0023] Preferably, the frequency domain data is normalized, as isdescribed below. The degree to which the frequency domain data needs tobe normalized gives an indication of amplitude. Hence, if a change inthis degree exceeds a predetermined threshold, that too may be taken toindicate an event boundary. Event start and end points resulting fromspectral changes and from amplitude changes may be ORed together so thatevent boundaries resulting from either type of change are identified.

[0024] In practical embodiments in which the audio is represented bysamples divided into blocks, each auditory event temporal start and stoppoint boundary necessarily coincides with a boundary of the block intowhich the time domain audio waveform is divided. There is a trade offbetween real-time processing requirements (as larger blocks require lessprocessing overhead) and resolution of event location (smaller blocksprovide more detailed information on the location of auditory events).

[0025] As a further option, as suggested above, but at the expense ofgreater computational complexity, instead of processing the spectralcontent of the time domain waveform in a single band of frequencies, thespectrum of the time domain waveform prior to frequency domainconversion may be divided into two or more frequency bands. Each of thefrequency bands may then be converted to the frequency domain andprocessed as though it were an independent channel. The resulting eventboundaries may then be ORed together to define the event boundaries forthat channel. The multiple frequency bands may be fixed, adaptive, or acombination of fixed and adaptive. Tracking filter techniques employedin audio noise reduction and other arts, for example, may be employed todefine adaptive frequency bands (e.g., dominant simultaneous sine wavesat 800 Hz and 2 kHz could result in two adaptively-determined bandscentered on those two frequencies).

[0026] Other techniques for providing auditory scene analysis may beemployed to identify auditory events in the present invention.

DESCRIPTION OF THE DRAWINGS

[0027]FIG. 1A is a flow chart showing the process of extraction of asignature from an audio signal in accordance with the present invention.The audio signal may, for example, represent music (e.g., a musicalcomposition or “song”).

[0028]FIG. 1B is a flow chart illustrating a process for the timealignment of two audio signal signals in accordance with the presentinvention.

[0029]FIG. 2 is a flow chart showing the process of extraction of audioevent locations and the optional extraction of dominant subbands from anaudio signal in accordance with the present invention.

[0030]FIG. 3 is a conceptual schematic representation depicting the stepof spectral analysis in accordance with the present invention.

[0031]FIGS. 4A and 4B are idealized audio waveforms showing a pluralityof auditory event locations and auditory event boundaries in accordancewith the present invention.

BEST MODE FOR CARRYING OUT THE INVENTION

[0032] In a practical embodiment of the invention, the audio signal isrepresented by samples that are processed in blocks of 512 samples,which corresponds to about 11.6 msec of input audio at a sampling rateof 44.1 kHz. A block length having a time less than the duration of theshortest perceivable auditory event (about 20 msec) is desirable. Itwill be understood that the aspects of the invention are not limited tosuch a practical embodiment. The principles of the invention do notrequire arranging the audio into sample blocks prior to determiningauditory events, nor, if they are, of providing blocks of constantlength. However, to minimize complexity, a fixed block length of 512samples (or some other power of two number of samples) is useful forthree primary reasons. First, it provides low enough latency to beacceptable for real-time processing applications. Second, it is apower-of-two number of samples, which is useful for fast Fouriertransform (FFT) analysis. Third, it provides a suitably large windowsize to perform useful auditory scene analysis.

[0033] In the following discussions, the input signals are assumed to bedata with amplitude values in the range [−1, +1].

[0034] Auditory Scene Analysis 2 (FIG. 1A)

[0035] Following audio input data blocking (not shown), the input audiosignal is divided into auditory events, each of which tends to beperceived as separate, in process 2 (“Auditory Scene Analysis”) of FIG.1A. Auditory scene analysis may be accomplished by an auditory sceneanalysis (ASA) process discussed above. Although one suitable processfor performing auditory scene analysis is described in further detailbelow, the invention contemplates that other useful techniques forperforming ASA may be employed.

[0036]FIG. 2 outlines a process in accordance with techniques of thepresent invention that may be used as the auditory scene analysisprocess of FIG. 1A. The ASA step or process 2 is composed of threegeneral processing substeps. The first substep 2-1 (“Perform SpectralAnalysis”) takes the audio signal, divides it into blocks and calculatesa spectral profile or spectral content for each of the blocks. Spectralanalysis transforms the audio signal into the short-term frequencydomain. This can be performed using any filterbank; either based ontransforms or banks of band-pass filters, and in either linear or warpedfrequency space (such as the Bark scale or critical band, which betterapproximate the characteristics of the human ear). With any filterbankthere exists a tradeoff between time and frequency. Greater timeresolution, and hence shorter time intervals, leads to lower frequencyresolution. Greater frequency resolution, and hence narrower subbands,leads to longer time intervals.

[0037] The first substep calculates the spectral content of successivetime segments of the audio signal. In a practical embodiment, describedbelow, the ASA block size is 512 samples of the input audio signal (FIG.3). In the second substep 2-2, the differences in spectral content fromblock to block are determined (“Perform spectral profile differencemeasurements”). Thus, the second substep calculates the difference inspectral content between successive time segments of the audio signal.In the third substep 2-3 (“Identify location of auditory eventboundaries”), when the spectral difference between one spectral-profileblock and the next is greater than a threshold, the block boundary istaken to be an auditory event boundary. Thus, the third substep sets anauditory event boundary between successive time segments when thedifference in the spectral profile content between such successive timesegments exceeds a threshold. As discussed above, a powerful indicatorof the beginning or end of a perceived auditory event is believed to bea change in spectral content. The locations of event boundaries arestored as a signature. An optional process step 2-4 (“Identify dominantsubband”) uses the spectral analysis to identify a dominant frequencysubband that may also be stored as part of the signature.

[0038] In this embodiment, auditory event boundaries define auditoryevents having a length that is an integral multiple of spectral profileblocks with a minimum length of one spectral profile block (512 samplesin this example). In principle, event boundaries need not be so limited.

[0039] Either overlapping or non-overlapping segments of the audio maybe windowed and used to compute spectral profiles of the input audio.Overlap results in finer resolution as to the location of auditoryevents and, also, makes it less likely to miss an event, such as atransient. However, as time resolution increases, frequency resolutiondecreases. Overlap also increases computational complexity. Thus,overlap may be omitted. FIG. 3 shows a conceptual representation ofnon-overlapping 512 sample blocks being windowed and transformed intothe frequency domain by the Discrete Fourier Transform (DFT). Each blockmay be windowed and transformed into the frequency domain, such as byusing the DFT, preferably implemented as a Fast Fourier Transform (FFT)for speed.

[0040] The following variables may be used to compute the spectralprofile of the input block: N = number of samples in the input signal M= number of windowed samples used to compute spectral profile P = numberof samples of spectral computation overlap Q = number of spectralwindows/regions computed

[0041] In general, any integer numbers may be used for the variablesabove. However, the implementation will be more efficient if M is setequal to a power of 2 so that standard FFTs may be used for the spectralprofile calculations. In a practical embodiment of the auditory sceneanalysis process, the parameters listed may be set to: M = 512 samples(or 11.6 msec at 44.1 kHz) P = 0 samples (no overlap)

[0042] The above-listed values were determined experimentally and werefound generally to identify with sufficient accuracy the location andduration of auditory events. However, setting the value of P to 256samples (50% overlap) has been found to be useful in identifying somehard-to-find events. While many different types of windows may be usedto minimize spectral artifacts due to windowing, the window used in thespectral profile calculations is an M-point Hanning, Kaiser-Bessel orother suitable, preferably non-rectangular, window. The above-indicatedvalues and a Hanning window type were selected after extensiveexperimental analysis as they have shown to provide excellent resultsacross a wide range of audio material. Non-rectangular windowing ispreferred for the processing of audio signals with predominantly lowfrequency content. Rectangular windowing produces spectral artifactsthat may cause incorrect detection of events. Unlike certain codecapplications where an overall overlap/add process must provide aconstant level, such a constraint does not apply here and the window maybe chosen for characteristics such as its time/frequency resolution andstop-band rejection.

[0043] In substep 2-1 (FIG. 2), the spectrum of each M-sample block maybe computed by windowing the data by an M-point Hanning, Kaiser-Besselor other suitable window, converting to the frequency domain using anM-point Fast Fourier Transform, and calculating the magnitude of the FFTcoefficients. The resultant data is normalized so that the largestmagnitude is set to unity, and the normalized array of M numbers isconverted to the log domain. The array need not be converted to the logdomain, but the conversion simplifies the calculation of the differencemeasure in substep 2-2. Furthermore, the log domain more closely matchesthe log domain amplitude nature of the human auditory system. Theresulting log domain values have a range of minus infinity to zero. In apractical embodiment, a lower limit can be imposed on the range ofvalues; the limit may be fixed, for example −60 dB, or befrequency-dependent to reflect the lower audibility of quiet sounds atlow and very high frequencies. (Note that it would be possible to reducethe size of the array to M/2 in that the FFT represents negative as wellas positive frequencies).

[0044] Substep 2-2 calculates a measure of the difference between thespectra of adjacent blocks. For each block, each of the M (log) spectralcoefficients from substep 2-1 is subtracted from the correspondingcoefficient for the preceding block, and the magnitude of the differencecalculated (the sign is ignored). These M differences are then summed toone number. Hence, for the whole audio signal, the result is an array ofQ positive numbers; the greater the number the more a block differs inspectrum from the preceding block. This difference measure could also beexpressed as an average difference per spectral coefficient by dividingthe difference measure by the number of spectral coefficients used inthe sun (in this case M coefficients).

[0045] Substep 2-3 identifies the locations of auditory event boundariesby applying a threshold to the array of difference measures from substep2-2 with a threshold value. When a difference measure exceeds athreshold, the change in spectrum is deemed sufficient to signal a newevent and the block number of the change is recorded as an eventboundary. For the values of M and P given above and for log domainvalues (in substep 2-1) expressed in units of dB, the threshold may beset equal to 2500 if the whole magnitude FFT (including the mirroredpart) is compared or 1250 if half the FFT is compared (as noted above,the FFT represents negative as well as positive frequencies—for themagnitude of the FFT, one is the mirror image of the other). This valuewas chosen experimentally and it provides good auditory event boundarydetection. This parameter value may be changed to reduce (increase thethreshold) or increase (decrease the threshold) the detection of events.

[0046] The details of this practical embodiment are not critical. Otherways to calculate the spectral content of successive time segments ofthe audio signal, calculate the differences between successive timesegments, and set auditory event boundaries at the respective boundariesbetween successive time segments when the difference in the spectralprofile content between such successive time segments exceeds athreshold may be employed.

[0047] For an audio signal consisting of Q blocks (of size M samples),the output of the auditory scene analysis process of function 2 of FIG.1A is an array B(q) of information representing the location of auditoryevent boundaries where q=0, 1, . . . , Q-1. For a block size of M=512samples, overlap of P=0 samples and a signal-sampling rate of 44.1 kHz,the auditory scene analysis function 2 outputs approximately 86 values asecond. Preferably, the array B(q) is stored as the signature, suchthat, in its basic form, without the optional dominant subband frequencyinformation, the audio signal's signature is an array B(q) representinga string of auditory event boundaries.

[0048] An example of the results of auditory scene analysis for twodifferent signals is shown in FIGS. 4A and 4B. The top plot, FIG. 4A,shows the results of auditory scene processing where auditory eventboundaries have been identified at samples 1024 and 1536. The bottomplot, FIG. 4B, shows the identification of event boundaries at samples1024, 2048 and 3072.

[0049] Identify Dominant Subband (Optional)

[0050] For each block, an optional additional step in the ASA processing(shown in FIG. 2) is to extract information from the audio signaldenoting the dominant frequency “subband” of the block (conversion ofthe data in each block to the frequency domain results in informationdivided into frequency subbands). This block-based information may beconverted to auditory-event based information, so that the dominantfrequency subband is identified for every auditory event. Thisinformation for every auditory event provides the correlation processing(described below) with further information in addition to the auditoryevent boundary information. The dominant (largest amplitude) subband maybe chosen from a plurality of subbands, three or four, for example, thatare within the range or band of frequencies where the human ear is mostsensitive. Alternatively, other criteria may be used to select thesubbands.

[0051] The spectrum may be divided, for example, into three subbands.The preferred frequency range of the subbands is: Subband 1 301 Hz to560 Hz Subband 2  560 Hz to 1938 Hz Subband 3 1938 Hz to 9948 Hz

[0052] To determine the dominant subband, the square of the magnitudespectrum (or the power magnitude spectrum) is summed for each subband.This resulting sum for each subband is calculated and the largest ischosen. The subbands may also be weighted prior to selecting thelargest. The weighting may take the form of dividing the sum for eachsubband by the number of spectral values in the subband, oralternatively may take the form of an addition or multiplication toemphasize the importance of a band over another. This can be usefulwhere some subbands have more energy on average than other subbands butare less perceptually important.

[0053] Considering an audio signal consisting of Q blocks, the output ofthe dominant subband processing is an array DS(q) of informationrepresenting the dominant subband in each block (q=0, 1, . . . , Q-1).Preferably, the array DS(q) is stored in the signature along with thearray B(q). Thus, with the optional dominant subband information, theaudio signal's signature is two arrays B(q) and DS(q), representing,respectively, a string of auditory event boundaries and a dominantfrequency subband within each block. Thus, in an idealized example, thetwo arrays could have the following values (for a case in which thereare three possible dominant subbands). 1 0 1 0 0 0 1 0 0 1 0 0 0 0 0 1 0(Event Boundaries) 1 1 2 2 2 2 1 1 1 3 3 3 3 3 3 1 1 (Dominant Subbands)

[0054] In most cases, the dominant subband remains the same within eachauditory event, as shown in this example, or has an average value if itis not uniform for all blocks within the event. Thus, a dominant subbandmay be determined for each auditory event and the array DS(q) may bemodified to provide that the same dominant subband is assigned to eachblock within an event.

[0055] Time Offset Calculation

[0056] The output of the Signature Extraction (FIG. 1A) is one or morearrays of auditory scene analysis information that are stored as asignature, as described above. The Time Offset Calculation function(FIG. 1B) takes two signatures and calculates a measure of their timeoffset. This is performed using known cross correlation methods.

[0057] Let S₁ (length Q₁) be an array from Signature 1 and S₂ (lengthQ₂) an array from Signature 2. First, calculate the cross-correlationarray R_(E) ₁ _(E) ₂ (see, for example, John G. Proakis, Dimitris G.Manolakis, Digital Signal Processing: Principles, Algorithms, andApplications, Macmillan Publishing Company, 1992, ISBN 0-02-396815-X).$\begin{matrix}\begin{matrix}{{R_{E_{1}E_{2}}(l)} = {\sum\limits_{q = {- \infty}}^{\infty}{{S_{1}(q)} \cdot {S_{2}\left( {q - l} \right)}}}} & \quad & {{l = 0},{\pm 1},{\pm 2},\ldots}\end{matrix} & (1)\end{matrix}$

[0058] In a practical embodiment, the cross-correlation is performedusing standard FFT based techniques to reduce execution time.

[0059] Since both S₁ and S₂ are finite in length, the non-zero componentof R_(E) ₁ _(E) ₂ has a length of Q₁+Q₂−1. The lag l corresponding tothe maximum element in R_(E) ₁ _(E) ₂ represents the time offset of S₂relative to S₁.

l _(peak) =l for MAX(R _(E) ₁ _(E) ₂ (l))   (2).

[0060] This offset has the same units as the signature arrays S₁ and S₂.In a practical implementation, the elements of S₁ and S₂ have an updaterate equivalent to the audio block size used to generate the arraysminus the overlap of adjacent blocks: that is, M−P=512−0=512 samples.Therefore the offset has units of 512 audio samples.

[0061] Time Alignment

[0062] The Time Alignment function 6 (FIG. 1B) uses the calculatedoffset to time align the two audio signals. It takes as inputs, AudioSignals 1 and 2 (used to generate the two signatures) and offsets one inrelation to the other such that they are both more closely aligned intime. The two aligned signals are output as Audio Signals 3 and 4. Theamount of delay or offset applied is the product of the relativesignature delay l_(peak) between signature S₂ and S₁, and the resolutionM−P, in samples, of the signatures.

[0063] For applications where only the passage common to the two sourcesis of interest (as in the case of watermark detection where unmarked andmarked signals are to be directly compared), the two sources may betruncated to retain only that common passage.

[0064] For applications where no information is to be lost, one signalmay be offset by the insertion of leading samples. For example let x₁(n)be the samples of Audio Signal 1 with a length of N₁ samples, and x₂(n)be the samples of Audio Signal 2 with a length of N₂ samples. Alsol_(peak) represents the offset of S₂ relative to S₁ in units of M−Paudio samples.

[0065] The sample offset D₂₁ of Audio Signal 2 relative to Audio Signal1 is the product of the signature offset l_(peak) and M−P.

D ₂₁ =l _(peak).(M−P)   (3)

[0066] If D₂₁ is zero, the both input signals are output unmodified assignals 3 and 4 (see FIG. 1B). If D₂₁ is positive then input signalx₁(n) is modified by inserting leading samples. $\begin{matrix}{{x_{1}^{\prime}(m)} = \left\{ \begin{matrix}0 & \quad & {0 \leq m < D_{21}} \\{x_{1}(n)} & {0 \leq n < L_{1}} & {m = {n + D_{21}}}\end{matrix} \right.} & (4)\end{matrix}$

[0067] Signals x₁(n) and x₂(n) are output as Signals 3 and 4 (see FIG.1B). If D₂₁ is negative then input signal x₂(n) is modified by insertingleading samples. $\begin{matrix}{{x_{2}^{\prime}(m)} = \left\{ \begin{matrix}0 & \quad & {0 \leq m < {- D_{21}}} \\{x_{2}(n)} & {0 \leq n < L_{2}} & {m = {n - D_{21}}}\end{matrix} \right.} & (5)\end{matrix}$

[0068] Computation Complexity and Accuracy

[0069] The computational power required to calculate the offset isproportional to the lengths of the signature arrays, Q₁ and Q₂ . Becausethe process described has some offset error, the time alignment processof the present invention may be followed by a conventional processhaving a finer resolution that works directly with the audio signals,rather than signatures. For example such a process may take sections ofthe aligned audio signals (slightly longer than the offset error toensure some overlap) and cross correlate the sections directly todetermine the exact sample error or fine offset.

[0070] Since the signature arrays are used to calculate the sampleoffset, the accuracy of the time alignment method is limited to theaudio block size used to generate the signatures: in thisimplementation, 512 samples. In other words this method will have errorin the sample offset of approximately plus/minus half the block size: inthis implementation ±256 samples.

[0071] This error can be reduced by increasing the resolution of thesignatures; however there exists a tradeoff between accuracy andcomputational complexity. Lower offset error requires finer resolutionin the signature arrays (more array elements) and this requires higherprocessing power in computing the cross correlation. Higher offset errorrequires coarser resolution in the signature arrays (less arrayelements) and this requires lower processing power in computing thecross correlation.

[0072] Applications

[0073] Watermarking involves embedding information in a signal byaltering the signal in some predefined way, including the addition ofother signals, to create a marked signal. The detection or extraction ofembedded information often relies on a comparison of the marked signalwith the original source. Also the marked signal often undergoes otherprocessing including audio coding and speaker/microphone acoustic pathtransmission. The present invention provides a way of time aligning amarked signal with the original source to then facilitate the extractionof embedded information.

[0074] Subjective and objective methods for determining audio coderquality compare a coded signal with the original source, used togenerate the coded signal, in order to create a measure of the signaldegradation (for example an ITU-R 5 point impairment score). Thecomparison relies on time alignment of the coded audio signal with theoriginal source signal. This method provides a means of time aligningthe source and coded signals.

[0075] Other applications of the invention are possible, for example,improving the lip-syncing of audio and video signals, as mentionedabove.

[0076] It should be understood that implementation of other variationsand modifications of the invention and its various aspects will beapparent to those skilled in the art, and that the invention is notlimited by these specific embodiments described. It is thereforecontemplated to cover by the present invention any and allmodifications, variations, or equivalents that fall within the truespirit and scope of the basic underlying principles disclosed andclaimed herein.

[0077] The present invention and its various aspects may be implementedas software functions performed in digital signal processors, programmedgeneral-purpose digital computers, and/or special purpose digitalcomputers. Interfaces between analog and digital signal streams may beperformed in appropriate hardware and/or as functions in software and/orfirmware.

1. A method for time aligning audio signals, wherein one signal has beenderived from the other or both have been derived from another signal,comprising deriving reduced-information characterizations of said audiosignals, wherein said reduced-information characterizations are based onauditory scene analysis, calculating the time offset of onecharacterization with respect to the other characterization, modifyingthe temporal relationship of said audio signals with respect to eachother in response to said time offset such that said audio signals aresubstantially coincident with each other.
 2. The method of claim 1wherein said reduced-information characterizations are derived from saidaudio signals and embedded in respective other signals that are carriedwith the audio signals from which they were derived prior to saidcalculating and modifying.
 3. The method of claim 2 wherein said othersignals are the video portion of a television signal and said audiosignal are the audio portion of the respective television signal.
 4. Amethod for time aligning an audio signal and another signal, comprisingderiving a reduced-information characterization of the audio signal andembedding said characterization in the other signal when the audiosignal and other signal are substantially in synchronism, wherein saidcharacterization is based on auditory scene analysis, recovering theembedded characterization of said audio signal from said other signaland deriving a reduced-information characterization of said audio signalfrom said audio signal in the same way the embedded characterization ofthe audio signal was derived based on auditory scene analysis, aftersaid audio signal and said other signal have been subjected todifferential time offsets, calculating the time offset of onecharacterization with respect to the other characterization, modifyingthe temporal relationship of the audio signal with respect to the othersignal in response to said time offset such that the audio signal andvideo signal are substantially in synchronism with each other.
 5. Themethod of claim 4 wherein said other signal is a video signal.
 6. Themethod of claim 1 or claim 4 wherein calculating the time offsetincludes performing a cross-correlation of said characterizations. 7.The method of any one of claims 1-6 wherein said reduced-informationcharacterizations based on auditory scene analysis are arrays ofinformation representing at least the location of auditory eventboundaries.
 8. The method of claim 7 wherein said auditory eventboundaries are determined by calculating the spectral content ofsuccessive time segments of said audio signal, calculating thedifference in spectral content between successive time segments of saidaudio signal, and identifying an auditory event boundary as the boundarybetween successive time segments when the difference in the spectralcontent between such successive time segments exceeds a threshold. 9.The method of claim 7 or claim 8 wherein said arrays of information alsorepresent the dominant frequency subband of each of said auditoryevents.